61 research outputs found

    Wavelet-filtering of symbolic music representations for folk tune segmentation and classification

    Get PDF
    The aim of this study is to evaluate a machine-learning method in which symbolic representations of folk songs are segmented and classified into tune families with Haar-wavelet filtering. The method is compared with previously proposed Gestaltbased method. Melodies are represented as discrete symbolic pitch-time signals. We apply the continuous wavelet transform (CWT) with the Haar wavelet at specific scales, obtaining filtered versions of melodies emphasizing their information at particular time-scales. We use the filtered signal for representation and segmentation, using the wavelet coefficients ’ local maxima to indicate local boundaries and classify segments by means of k-nearest neighbours based on standard vector-metrics (Euclidean, cityblock), and compare the results to a Gestalt-based segmentation method and metrics applied directly to the pitch signal. We found that the wavelet based segmentation and waveletfiltering of the pitch signal lead to better classification accuracy in cross-validated evaluation when the time-scale and other parameters are optimized. 1

    A Wavelet-Based Approach to Pattern Discovery in Melodies

    Get PDF

    Learning Speech Emotion Representations in the Quaternion Domain

    Get PDF
    The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimize each latent axis of the embeddings for the classification of a specific emotion-related characteristic: valence, arousal, dominance and overall emotion. On the other hand, the quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: Iemocap, Ravdess, EmoDb and Tess, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach. The RH-emo repository is available at: https://github.com/ispamm/rhemo.Comment: Paper Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processin

    Adapting computational music similarity models to geographic user groups

    Get PDF
    We present first results of experiments using music similarity ratings from human participants for group-specific similarity prediction. Music similarity is a key topic of research in music psychology and ethnomusicology. Computational models of music similarity have many applications such as music recommendation and indexing of music databases.This study evaluates the feasibility of adapting similarity models to location-specific subsets of similarity ratings. To this end we use information on the country where the data was provided. Apart from directly training similarity models to the localised data, we perform a gradual adaptation of a previously trained general similarity model to the location-specific data. This allows us to compare the general and localised similarity models, providing a comparative analysis of the importance of acoustic features (e.g. loudness, timbre, tempo, chroma, key) for modelling similarity judgment across user groups. In future work, such groups could be selected to yield culturally determined models.Our results show that localised models can be trained, but in comparison to general models this task proves more difficult due to the relatively small amount of training data available from each country. We found that the performance for some localised models can be increased using a general model as a basis for training. In one case this allows for the analysis of relevance of individual features for the specific data.The similarity ratings used in our experiments were collected in the online Game With A Purpose "Spot The Odd Song Out". The mostly popular music presented in the game is based on the openly available MagnaTagATune and Million Song datasets, two large music datasets that come with acoustic descriptors for the music. Additionally to the similarity data being collected via triad questions, the modular game architecture allows for the collection of other human annotations, such as timbre and rhythm data. We also describe the extensible game with a discussion of further possibilities of its application
    • …
    corecore